# Load libraries for EDA
library('tidyverse') # Data wrangling & visualization
library('gt') # Create professional tables
library('janitor') # Clean column names & messy data
library('psych') # Descriptive stats & psych research tools
library('stringr') # String manipulation
library('lubridate') # Handle dates & times
library('rmarkdown') # Render R Markdown docs
library('dplyr') # Data manipulation (filter, select, mutate, etc.)
library('skimr') # Quick summary stats
library('tidyr') # Reshape/tidy data
library('readxl') # Read Excel files
library('ggplot2') # Data visualization
library('readr') # Read/write CSV & text files
library('knitr') # Generate dynamic reports
library('leaflet') # Generate dynamic map
setwd("/home/adam-bushman/All Projects/delivery-standardization-group")
# Read files into session
transactions <- as.data.frame(data.table::fread("data/original/transactional_data.csv"))
customer_address <- read.csv("data/original/customer_address_and_zip_mapping.csv")
customer_profile <- read.csv("data/original/customer_profile.csv")
delivery_cost <- readxl::read_xlsx('data/original/delivery_cost_data.xlsx')Exploratory Data Analysis | Swire Coca-Cola Capstone Project
IS 6813-001, Spring 2025 | Group 3
Business Problem Statement
Regional beverage bottler Swire Coca-Cola (SCCU) relies on two business models: 1) “Red Truck”, which features high-volume customers serviced personally by Swire, and 2) “White Truck” or “Alternate Route to Market”, which entails smaller customers serviced by a third-party distributor.
Swire’s current segmenting strategy has led to misallocation of resources, inflated expenses, and missed opportunities from clients with high-growth potential. Swire aims to better algin their customers with the business proposition of these models by identifying customer characteristics and rdering behavior that better determines the right business model for the long-term relationship.
EDA Introduction
In this exploratory data analysis (EDA) notebook, we assess the quality of the provided data, detail the data cleaning processes undertaken, define our success metrics, and outline the key questions driving our analysis. By examining customer characteristics and ordering behaviors, we aim to realign Swire’s customer segmentation with the strategic propositions of their two distribution models, ultimately enabling a more effective long-term relationship strategy.
File Investigation
Libraries & Data
We begin by setting up our session with the necessary libraries and data provided by Swire. These will be referenced often throughout this document.
Files from Swire
Swire Coca-Cola provided us with four (4) files we can use for the project. These include:
| Name | Description | Format | Columns | Rows |
|---|---|---|---|---|
| transactional_data | This dataset records detailed transactional information, including order quantities and delivery metrics. | .csv | 11 | 1,045,540 |
| customer_address_and_zip_mapping | This dataset maps ZIP codes to full address information. | .csv | 2 | 1,801 |
| customer_profile | This dataset provides detailed information about customers, including onboarding and purchasing behavior. | .csv | 11 | 30,478 |
| delivery_cost_data | This dataset describes the median 'per gallon/case' cost of delivery according to annual order volume and custoemr profile. | .xlsx | 5 | 160 |
Let’s take a peak at each of these in turn. We’ll remark on:
- Completeness (how many data are missing)
- Data types (are fields properly formatted for analysis)
- Distribution (how values are spread out)
- Potential wrangling (what we can do for improving data set quality)
transactional_data.csv
skim(transactions)| Name | transactions |
| Number of rows | 1045540 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| TRANSACTION_DATE | 0 | 1 | 8 | 10 | 0 | 723 | 0 |
| ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| WEEK | 0 | 1 | 26.23 | 14.52 | 1.0 | 14 | 26 | 38.00 | 52.00 | ▇▇▇▇▇ |
| YEAR | 0 | 1 | 2023.50 | 0.50 | 2023.0 | 2023 | 2023 | 2024.00 | 2024.00 | ▇▁▁▁▇ |
| CUSTOMER_NUMBER | 0 | 1 | 546643776.32 | 49426585.56 | 500245678.0 | 501091920 | 501548213 | 600080939.00 | 600975408.00 | ▇▁▁▁▇ |
| ORDERED_CASES | 0 | 1 | 26.85 | 126.76 | 0.0 | 0 | 7 | 18.50 | 8479.89 | ▇▁▁▁▁ |
| LOADED_CASES | 0 | 1 | 25.92 | 122.79 | 0.0 | 0 | 7 | 18.00 | 8171.56 | ▇▁▁▁▁ |
| DELIVERED_CASES | 0 | 1 | 25.13 | 121.52 | -3132.0 | 0 | 6 | 17.33 | 8069.48 | ▁▇▁▁▁ |
| ORDERED_GALLONS | 0 | 1 | 9.87 | 26.47 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| LOADED_GALLONS | 0 | 1 | 9.60 | 25.65 | 0.0 | 0 | 0 | 12.50 | 2562.50 | ▇▁▁▁▁ |
| DELIVERED_GALLONS | 0 | 1 | 9.21 | 25.18 | -1792.5 | 0 | 0 | 12.50 | 2292.50 | ▁▁▇▁▁ |
glimpse(transactions)Rows: 1,045,540
Columns: 11
$ TRANSACTION_DATE <chr> "1/5/2023", "1/6/2023", "1/9/2023", "1/11/2023", "1/…
$ WEEK <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 9…
$ YEAR <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ CUSTOMER_NUMBER <int> 501202893, 500264574, 501174701, 600586532, 50101432…
$ ORDER_TYPE <chr> "MYCOKE LEGACY", "MYCOKE LEGACY", "MYCOKE LEGACY", "…
$ ORDERED_CASES <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0, 18.0…
$ LOADED_CASES <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ DELIVERED_CASES <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ ORDERED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ LOADED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ DELIVERED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
Completeness
The data are fully complete; there’s not a single column with missing or empty values.
Data Types
We have a fair mix of data types, though some are improperly casted. Indentifiers like CUSOTMER_NUMBER shouldn’t be continuous, and dates like TRANSACTION_DATE can be enriched with a date/datetime format.
Distribution
The shape of the data seen is quite skewed: most all of the volume columns are highly skewed right (meaning most transactions are small). We do see some “negative” delivery volumes, which we are to understand as returns.
Potential Wrangling
This data set will benefit most from proper data types. Additionally, we can determine (for each customer) how often they see returned transactions (a potentially helpful indicator for client quality).
We don’t yet have a good sense for 1) the breadth of data longitudinally and 2) what order types exist. Let’s solve these with some basic visualizations.
# Prepare and plot the daily transaction volume over time
transactions |>
# Convert the TRANSACTION_DATE column from character to Date type
mutate(
TRANSACTION_DATE = lubridate::mdy(TRANSACTION_DATE) # mdy() parses dates in month-day-year format
) |>
# Group by transaction date and count number of transactions per day
group_by(TRANSACTION_DATE) |>
count() |>
ungroup() |> # Remove grouping to simplify plotting
# Create a time series line plot of daily transaction counts
ggplot(
aes(TRANSACTION_DATE, n) # Map x to date, y to daily count
) +
geom_line(color = swire_colors$blue) + # Line plot of transaction volume
geom_smooth(color = swire_colors$red) + # Smoothed trend line for easier interpretation
scale_x_date() + # Format the x-axis for dates
labs(
title = "Daily Transaction Volume", # Title of the plot
x = "Transaction Date" # Label for the x-axis
) +
theme_swire() + # Apply custom ggplot theme
theme(
axis.title.y = element_blank() # Remove y-axis title for cleaner look
)`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There’s obviously high variability in the number of transactions per day, but we are seeing a smoothing where the average day per week sees a transaction volume between 1,000 and 2,000. Additionally, it’s clear we have two (2) years of transactions: 2023 & 2024.
# Summarize the number and percentage of transactions by ORDER_TYPE
order_types <-
transactions |>
group_by(ORDER_TYPE) |>
count() |>
ungroup() |>
mutate(
perc = n / sum(n) # Calculate percent share for each order type
) |>
arrange(n) # Order from smallest to largest count
# Reorder factor levels to match ascending order of counts for plotting
order_types$ORDER_TYPE = factor(order_types$ORDER_TYPE, levels = order_types$ORDER_TYPE)
# Create a horizontal bar chart showing the distribution of ORDER_TYPE
ggplot(
order_types,
aes(n, ORDER_TYPE, label = scales::percent_format()(perc)) # Format labels as percentages
) +
geom_col(fill = swire_colors$red) + # Red bars for each order type
geom_text(
aes(
hjust = ifelse(n < 1e5, -0.1, 1.1), # Adjust horizontal text position based on bar length
color = ifelse(n < 1e5, "black", "white") # Use black text for short bars, white for long
)
) +
scale_color_identity() + # Use the raw color values from the data without scaling
labs(
title = "Distribution of `ORDER_TYPE`" # Plot title
) +
theme_swire() + # Apply custom ggplot theme
theme(
axis.title = element_blank(), # Remove axis titles
axis.text.x = element_blank(), # Hide x-axis text
axis.ticks.x = element_blank() # Hide x-axis ticks
)Okay, we now see there’s 6 unique types and some values that indicate null. That appears to be a text “null” so that’s an opportunity to clean up. There’s clearly 3 order types of largest interest:
- MYCOKE LEGACY
- CALL CENTER
- SALES REP
customer_address_and_zip_mapping.csv
skim(customer_address)| Name | customer_address |
| Number of rows | 1801 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| full.address | 0 | 1 | 45 | 73 | 0 | 1801 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zip | 0 | 1 | 28919.81 | 25588.64 | 1001 | 2153 | 21634 | 42440 | 71483 | ▇▃▅▁▅ |
glimpse(customer_address)Rows: 1,801
Columns: 2
$ zip <int> 71018, 71021, 71023, 71024, 71039, 71055, 71058, 71071, 7…
$ full.address <chr> "71018,Cotton Valley,Louisiana,LA,Webster,119,32.819,-93.…
Completeness
We aren’t seeing any incomplete/missing values so far here. That’s great news.
Data Types
The data types aren’t quite right. Because of leading zeros and 9 digit zip codes, zip should really be of type “character”. Also, we’re seeing the full address is comma-separated; not ideal.
Distribution
There’s no field where assessing distribution is helpful/relevant.
Potential Wrangling
Pulling out the values from full.address will be helpful. We can then make sure those are of proper data types.
One thing we don’t have a great sense for right now is where the addresses are located. Let’s see if we can’t make a dynamic map showing where customers are located:
Swire Market Geography
We’ll first pull out the latitude/longitude values.
# Split the `full.address` column into multiple geographic components
addr_exp <-
customer_address |>
separate(
full.address,
into = c("zip", "city", "state", "state_abbr", "county", "region", "lat", "lon"), # New columns to create
sep = ",", # Split on commas
convert = TRUE # Automatically convert resulting columns to appropriate types (e.g., numeric for lat/lon)
)Then we’ll add them to a map:
# Create an interactive leaflet map showing customer address locations
swire_map <-
leaflet(height = 800, width = 800) |> # Initialize leaflet map with set dimensions
addTiles() |> # Add default OpenStreetMap tile layer as a base
setView(
lng = mean(addr_exp$lon), # Center map horizontally at average longitude
lat = mean(addr_exp$lat), # Center map vertically at average latitude
zoom = 5 # Set initial zoom level for a broad regional view
) |>
addProviderTiles("CartoDB.Positron") |> # Add light-themed basemap from CartoDB
addCircleMarkers(
data = addr_exp, # Use the expanded address dataset
lng = ~lon, # Longitude from data
lat = ~lat, # Latitude from data
radius = 6, # Marker size
color = swire_colors$red, # Marker color using custom palette
fillOpacity = 0.75 # Transparency for better visibility when overlapping
)
# Render the map
swire_mapOkay, we’re seeing customers largely in Kansas, Kentucky, Maryland, and Massachusetts, with some in Louisiana. It’s not understood to what extent, if any, this geographic will hold value in a more optimized delivery model solution for Swire, but we’ll ensure the data are wrangled properly regardless.
customer_profile.csv
skim(customer_profile)| Name | customer_profile |
| Number of rows | 30478 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| logical | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| FREQUENT_ORDER_TYPE | 0 | 1 | 3 | 13 | 0 | 6 | 0 |
| FIRST_DELIVERY_DATE | 0 | 1 | 8 | 10 | 0 | 2401 | 0 |
| ON_BOARDING_DATE | 0 | 1 | 8 | 10 | 0 | 6487 | 0 |
| COLD_DRINK_CHANNEL | 0 | 1 | 5 | 13 | 0 | 9 | 0 |
| TRADE_CHANNEL | 0 | 1 | 6 | 28 | 0 | 26 | 0 |
| SUB_TRADE_CHANNEL | 0 | 1 | 4 | 27 | 0 | 48 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| LOCAL_MARKET_PARTNER | 0 | 1 | 0.90 | TRU: 27355, FAL: 3123 |
| CO2_CUSTOMER | 0 | 1 | 0.39 | FAL: 18496, TRU: 11982 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CUSTOMER_NUMBER | 0 | 1.0 | 538301800.92 | 47950644.47 | 500245678 | 501164306 | 501573994 | 600075795 | 600975408 | ▇▁▁▁▅ |
| PRIMARY_GROUP_NUMBER | 18196 | 0.4 | 2779.85 | 2608.64 | 4 | 444 | 1892 | 4488 | 9999 | ▇▃▂▁▁ |
| ZIP_CODE | 0 | 1.0 | 30252.25 | 25953.08 | 1001 | 2155 | 21771 | 42762 | 71483 | ▇▃▅▁▆ |
glimpse(customer_profile)Rows: 30,478
Columns: 11
$ CUSTOMER_NUMBER <int> 501556470, 501363456, 600075150, 500823056, 60008…
$ PRIMARY_GROUP_NUMBER <int> 376, NA, 2158, 2183, 1892, NA, 9996, NA, NA, 8803…
$ FREQUENT_ORDER_TYPE <chr> "MYCOKE LEGACY", "SALES REP", "SALES REP", "OTHER…
$ FIRST_DELIVERY_DATE <chr> "1/2/2024", "4/14/2022", "3/4/2016", "2/6/2019", …
$ ON_BOARDING_DATE <chr> "8/28/2023", "3/22/2022", "3/22/2012", "11/23/201…
$ COLD_DRINK_CHANNEL <chr> "DINING", "DINING", "DINING", "DINING", "PUBLIC S…
$ TRADE_CHANNEL <chr> "FAST CASUAL DINING", "COMPREHENSIVE DINING", "FA…
$ SUB_TRADE_CHANNEL <chr> "PIZZA FAST FOOD", "FSR - MISC", "OTHER FAST FOOD…
$ LOCAL_MARKET_PARTNER <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, …
$ CO2_CUSTOMER <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ ZIP_CODE <int> 21664, 1885, 67073, 1885, 1203, 67952, 40517, 156…
Completeness
All columns appear complete with the exception of PRIMARY_GROUP_NUMBER; there’s appropximately 60% of records with missing values here. This is interpreted as customers who are “stand alone”, as in not part of a corporate franchising arm. Therefore, it’s understandable these values are missing.
Data Types
Data types are largely appropriate. We have several “character” types we can turn into factors. Some dates should be appropriately cast. We do need to change identifiers to “character”. And similar processing as already described will be done for ZIP_CODE.
Distribution
There’s not a ton to note here except that the vast majority of customers are “local market parneters” and most aren’t purchasing C02 from Swire, either because they source elsewhere or are not fountain customers.
Potential Wrangling
It’s likely that a customer belonging to a franchise would be evaluated differently than a stand alone customer, all else being equal. We can derive a value for the number of franchises belonging to the primary group.
Let’s see what values/frequency is inherent to COLD_DRINK_CHANNEL:
# Summarize the number and percentage of customers by COLD_DRINK_CHANNEL
channel <-
customer_profile |>
group_by(COLD_DRINK_CHANNEL) |>
count() |>
ungroup() |>
mutate(
perc = n / sum(n) # Calculate percent share for each channel
) |>
arrange(n) # Order from smallest to largest count
# Reorder factor levels so bars display in ascending order
channel$COLD_DRINK_CHANNEL = factor(channel$COLD_DRINK_CHANNEL, levels = channel$COLD_DRINK_CHANNEL)
# Create a horizontal bar chart showing distribution of COLD_DRINK_CHANNEL
ggplot(
channel,
aes(n, COLD_DRINK_CHANNEL, label = scales::percent_format()(perc)) # Show percentage labels
) +
geom_col(fill = swire_colors$red) + # Red bars for each channel
geom_text(
aes(
hjust = ifelse(n < 10000, -0.1, 1.1), # Position text inside or outside based on count
color = ifelse(n < 10000, "black", "white") # Ensure text is visible on the background
)
) +
scale_color_identity() + # Use exact color values with no scale transformation
labs(
title = "Distribution of `COLD_DRINK_CHANNEL`" # Plot title
) +
theme_swire() + # Apply custom ggplot theme
theme(
axis.title = element_blank(), # Remove axis titles
axis.text.x = element_blank(), # Hide x-axis text
axis.ticks.x = element_blank() # Hide x-axis ticks
)There’s 9 different channels, largely dominated by “DINING”. Others of interest include “EVENT” and “GOODS”.
delivery_cost_data.xlsx
skim(delivery_cost)| Name | delivery_cost |
| Number of rows | 160 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Cold Drink Channel | 0 | 1 | 5 | 13 | 0 | 8 | 0 |
| Vol Range | 0 | 1 | 5 | 11 | 0 | 10 | 0 |
| Applicable To | 0 | 1 | 8 | 16 | 0 | 2 | 0 |
| Cost Type | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Median Delivery Cost | 0 | 1 | 2.6 | 1.71 | 0.37 | 1.33 | 2.24 | 3.47 | 8.59 | ▇▆▂▁▁ |
glimpse(delivery_cost)Rows: 160
Columns: 5
$ `Cold Drink Channel` <chr> "WORKPLACE", "WORKPLACE", "WORKPLACE", "WORKPLA…
$ `Vol Range` <chr> "0 - 149", "150 - 299", "300 - 449", "450 - 599…
$ `Applicable To` <chr> "Bottles and Cans", "Bottles and Cans", "Bottle…
$ `Median Delivery Cost` <dbl> 8.0649504, 4.1656458, 2.9915579, 2.5242219, 2.0…
$ `Cost Type` <chr> "Per Case", "Per Case", "Per Case", "Per Case",…
Completeness
All values are present; there shouldn’t be any imputation necessary.
Data Types
Vol Range is a field that needs to be split out into a “min”/“max”. Everyting else seems very reasonable.
Distribution
In most cases, the distribution of Median Delivery Cost is skewed right, which is indicative of outlier costs for troublesome clients.
Potential Wrangling
It’s notable that the column names are formatted differently. This is an area of standardization. Additionally, we’ll need to do quite a bit of prep to get the appropriate annual volumes by customer and intersect with this lookup table.
# Ensure the factor order of Cold Drink Channel matches that used in the earlier chart
delivery_cost$`Cold Drink Channel` = factor(
delivery_cost$`Cold Drink Channel`,
levels = channel$COLD_DRINK_CHANNEL # Use levels from previous summary to ensure consistent ordering
)
# Create faceted boxplots showing delivery cost distribution by channel and product type
ggplot(
delivery_cost,
aes(`Median Delivery Cost`, `Cold Drink Channel`) # Map cost to x-axis, channel to y-axis
) +
geom_boxplot(color = swire_colors$red) + # Red boxplots for visual emphasis
scale_x_continuous(labels = scales::label_currency()) + # Format x-axis as currency
facet_wrap(~`Applicable To`, nrow = 1, scales = "free") + # Create separate plots for each product type
labs(
title = "Distribution of Median Delivery Cost", # Plot title
subtitle = "By `Cold Drink Channel` and `Product Type`" # Clarify the facets
) +
theme_swire() + # Apply custom ggplot theme
theme(
axis.title = element_blank() # Remove axis titles for cleaner appearance
)Interestingly, there’s some real variance among costs in some Cold Drink Channels but less in others. Additionally, it’s not consistent between “Bottles and Cans” and “Fountain” products.
Data Wrangling
Having explored the files individually and making note of important details, we can now move on to wrangling these data into a single object, suitable for the more in-depth analysis upcoming. This wrangling will comprise of:
- Prep individual files for level-of-detail
- Combining the four (4) individual files we collected
- Standardizing column names
- Developing a theory for ordering said columns
- Casting fields to their proper data types
- Derived columns that further describe customer profiles
- Settle on transactional level data with appropriate cost estimates
- Etc.
Prepping the Cost Data
delivery_cost_expanded <-
delivery_cost |>
# Split the volume range into an object
mutate(
range_obj = purrr::map(`Vol Range`, str_split, " - ")
) |>
# Unnest the object for individual reference
unnest(range_obj) |>
unnest_wider(range_obj, names_sep = "_") |>
# Handle the "1350+" scenario
mutate(
min_vol = purrr::map_chr(range_obj_1, str_replace, "\\+", ""),
max_vol = ifelse(is.na(range_obj_2), (2^31) - 1, range_obj_2)
) |>
# Turn volumes from charaters to integers
mutate(
across(min_vol:max_vol, as.integer)
) |>
# Drop irrelevant columns
select(-c(range_obj_1, range_obj_2, `Vol Range`))annual_cust_volume <-
# Take transaction level data
transactions |>
# Bring in the customer profile for the `Cold Drink Channel`
inner_join(
customer_profile,
join_by(CUSTOMER_NUMBER)
) |>
# Get annual cases/gallons by customer
group_by(YEAR, CUSTOMER_NUMBER, COLD_DRINK_CHANNEL) |>
summarise(
annual_cases = sum(DELIVERED_CASES),
annual_gallons = sum(DELIVERED_GALLONS),
.groups = "drop"
)# Join delivery cost estimates to each customer's annual volume to assign appropriate tiered rates
delivery_cost_tiers <-
annual_cust_volume |>
# Join delivery costs for *case-based* products (excluding Fountain)
left_join(
delivery_cost_expanded |>
filter(`Applicable To` != 'Fountain'), # Filter to non-fountain products
join_by(
COLD_DRINK_CHANNEL == `Cold Drink Channel`, # Match by drink channel
annual_cases >= min_vol, # Ensure volume falls within the defined min
annual_cases <= max_vol # Ensure volume falls within the defined max
)
) |>
# Join delivery costs for *gallon-based* products (only Fountain)
left_join(
delivery_cost_expanded |>
filter(`Applicable To` == 'Fountain'), # Filter only fountain records
join_by(
COLD_DRINK_CHANNEL == `Cold Drink Channel`, # Match by drink channel
annual_gallons >= min_vol, # Match gallon volume to range
annual_gallons <= max_vol
),
suffix = c(".c", ".g") # Add suffixes to distinguish overlapping column names
) |>
# Select only the relevant fields for output
select(
YEAR, CUSTOMER_NUMBER,
case_delivery_cost = `Median Delivery Cost.c`, # Cost per case delivery
gallon_delivery_cost = `Median Delivery Cost.g` # Cost per gallon delivery
)
# Take a peek
head(delivery_cost_tiers)# A tibble: 6 × 4
YEAR CUSTOMER_NUMBER case_delivery_cost gallon_delivery_cost
<int> <int> <dbl> <dbl>
1 2023 500245678 4.47 2.68
2 2023 500245685 8.59 2.01
3 2023 500245686 7.33 4.62
4 2023 500245687 5.59 3.76
5 2023 500245689 8.59 2.53
6 2023 500245690 8.59 3.02
Prep the Customer Address Object
cust_addr_expanded <-
customer_address |>
# Split the full address into an object
mutate(
addr_obj = purrr::map(full.address, str_split, ",")
) |>
# Unnest the object for individual reference
unnest(addr_obj) |>
unnest_wider(addr_obj, names_sep = "_") |>
# Pad the zip code with leading zeros and make a character
mutate(
zip = str_pad(zip, 5, "left", pad = "0")
) |>
# Rename columns
rename(
city = addr_obj_2,
state = addr_obj_3,
state_abbr = addr_obj_4,
county = addr_obj_5,
lat = addr_obj_7,
lon = addr_obj_8
) |>
# Turn lat/lon values to numeric
mutate(
across(lat:lon, as.numeric)
) |>
# Drop irrelevant columns
select(-c(full.address, addr_obj_1, addr_obj_6))Combine Individual Files
combined_data_raw <-
# Take transactions
transactions |>
# Join the customer profile data thereto
inner_join(
customer_profile |> mutate(zip = str_pad(
ZIP_CODE, 5, "left", "0"
)),
join_by(CUSTOMER_NUMBER)
) |>
# Join the customer address data thereto
inner_join(
cust_addr_expanded,
join_by(zip)
) |>
# Join the delivery cost tiers data thereto
inner_join(
delivery_cost_tiers,
join_by(YEAR, CUSTOMER_NUMBER)
)Standardize Names & Data Types
combined_data_std <-
# Take the combined data from above
combined_data_raw |>
# Standardize the names
clean_names() |>
# Standardize data types
mutate(
# Convert charater dates to date types
across(c(transaction_date, first_delivery_date, on_boarding_date), lubridate::mdy),
# Turn IDs into characters
across(c(customer_number, primary_group_number), as.character),
# Turn finite categorical fields into factors
across(
c(order_type, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, state, state_abbr),
as.factor
)
) |>
# Remove irrelevant columns
select(-c(zip_code))Enrich Dataset with New Fields
swire_data_full <-
combined_data_std |>
# Add new fields
mutate(
# Calculate delivered gallons cost
# Assume a return is only half as costly as a normal delivery
delivered_gallons_cost = case_when(
delivered_gallons < 0 ~ -1 * delivered_gallons * gallon_delivery_cost / 2,
TRUE ~ delivered_gallons * gallon_delivery_cost
),
# Calculate delivered case cost
# Assume a return is only half as costly as a normal delivery
delivered_cases_cost = case_when(
delivered_cases < 0 ~ -1 * delivered_cases * case_delivery_cost / 2,
TRUE ~ delivered_cases * case_delivery_cost
),
# Create 'total' columns representing the sum of gallons & cases
ordered_total = ordered_gallons + ordered_cases,
loaded_total = loaded_gallons + loaded_cases,
delivered_total = delivered_gallons + delivered_cases,
) |>
group_by(year, primary_group_number) |>
mutate(
# Calculate number of customers belonging to each primary group by year
primary_group_customers = ifelse(is.na(primary_group_number), 0, n_distinct(customer_number))
) |>
group_by(year, customer_number) |>
mutate(
# Calculate how often a customer issues a return each year
return_frequency = sum(ifelse(delivered_cases < 0 | delivered_gallons < 0, 1, 0))
) |>
ungroup() |>
# Drop select columns that are no longer relevant
select(-c(gallon_delivery_cost, case_delivery_cost)) |>
# Order the columns logically
select(
# CUSTOMER PROFILE ITEMS
customer_number, primary_group_number, primary_group_customers,
on_boarding_date, first_delivery_date, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, local_market_partner, co2_customer, city, zip, state, state_abbr, county, lat, lon,
# TRANSACTION DETAILS
transaction_date, week, year, order_type,
ordered_cases, loaded_cases, delivered_cases, delivered_cases_cost,
ordered_gallons, loaded_gallons, delivered_gallons, delivered_gallons_cost,
ordered_total, loaded_total, delivered_total,
return_frequency
)Final Data Set
glimpse(swire_data_full)Rows: 1,045,540
Columns: 34
$ customer_number <chr> "501202893", "500264574", "501174701", "600586…
$ primary_group_number <chr> NA, "1894", NA, "8397", "1993", "437", NA, NA,…
$ primary_group_customers <dbl> 0, 47, 0, 49, 78, 651, 0, 0, 0, 0, 28, 0, 0, 8…
$ on_boarding_date <date> 2021-04-02, 2015-12-08, 2021-01-26, 1997-02-2…
$ first_delivery_date <date> 2021-05-07, 2018-03-23, 2021-04-12, 2017-05-0…
$ cold_drink_channel <fct> DINING, WELLNESS, DINING, BULK TRADE, GOODS, G…
$ frequent_order_type <fct> SALES REP, OTHER, MYCOKE360, SALES REP, SALES …
$ trade_channel <fct> COMPREHENSIVE DINING, HEALTHCARE, FAST CASUAL …
$ sub_trade_channel <fct> FSR - MISC, OTHER HEALTHCARE, OTHER FAST FOOD,…
$ local_market_partner <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRU…
$ co2_customer <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ city <chr> "Mahaska", "Banner", "Chelmsford", "Bernardsto…
$ zip <chr> "66955", "41603", "01824", "01337", "67473", "…
$ state <fct> Kansas, Kentucky, Massachusetts, Massachusetts…
$ state_abbr <fct> KS, KY, MA, MA, KS, KS, MD, MA, KY, KS, MD, KY…
$ county <chr> "Washington", "Floyd", "Middlesex", "Franklin"…
$ lat <dbl> 39.9845, 37.5707, 42.5911, 42.6952, 39.4194, 3…
$ lon <dbl> -97.3453, -82.6806, -71.3556, -72.5772, -98.69…
$ transaction_date <date> 2023-01-05, 2023-01-06, 2023-01-09, 2023-01-1…
$ week <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7…
$ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ order_type <fct> MYCOKE LEGACY, MYCOKE LEGACY, MYCOKE LEGACY, S…
$ ordered_cases <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0…
$ loaded_cases <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases_cost <dbl> 8.585482, 56.436259, 17.170965, 49.293328, 134…
$ ordered_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ loaded_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons_cost <dbl> 129.7426990, 0.0000000, 0.0000000, 7.1199352, …
$ ordered_total <dbl> 91.000000, 12.500000, 2.000000, 20.500000, 29.…
$ loaded_total <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ delivered_total <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ return_frequency <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
Now, we have a single, standardized data set that is enriched, properly formatted, and well-suited to the remaining analysis.
Data Exploration
Fulfillment and Order Frequency
In the table below, I calculated the fulfillment rate for various customers. The fulfillment rate is defined as total delivered cases divided by total ordered cases.
- A fulfillment rate of 1 means the customer receives exactly what they order.
- A fulfillment rate greater than 1 indicates they are being delivered more than they ordered.
It may be worth investigating whether these customers have high return rates or if they consistently receive more than they request. If it’s the latter, this pattern could be an indicator of high sales demand.
# Calculate fulfillment rates by customer
fulfillment_rates <- swire_data_full %>%
group_by(customer_number) %>%
summarise(
total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
) %>%
filter(!is.na(fulfillment_rate)) %>% # Remove rows where fulfillment rate is undefined
arrange(desc(fulfillment_rate))
# Define the customer numbers to keep for table
selected_customers <- c("501261447", "501563326", "501580838", "501686677", "501483916", "600563579",
"501004189", "501004936", "501006776", "501006936", "501007325", "501007477", "501007757")
# Summarizing the fulfillment rates and filtering the selected customers for table
filtered_fulfillment_rates <- swire_data_full %>%
group_by(customer_number) %>%
summarise(
total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
) %>%
filter(customer_number %in% selected_customers) %>% # Keep only the selected customers
arrange(desc(fulfillment_rate))
# Display the filtered table
gt(filtered_fulfillment_rates) |>
tab_header(
title = "Fulfillment Rates for Selected Customers"
)| Fulfillment Rates for Selected Customers | |||
|---|---|---|---|
| customer_number | total_ordered_cases | total_delivered_cases | fulfillment_rate |
| 501261447 | 3.0 | 18.0 | 6.0000000 |
| 501563326 | 1.0 | 6.0 | 6.0000000 |
| 501580838 | 2.0 | 12.0 | 6.0000000 |
| 501686677 | 1.0 | 6.0 | 6.0000000 |
| 600563579 | 159.0 | 915.0 | 5.7547170 |
| 501483916 | 43.0 | 221.5 | 5.1511628 |
| 501004189 | 50.0 | 50.0 | 1.0000000 |
| 501004936 | 10.5 | 10.5 | 1.0000000 |
| 501006776 | 801.0 | 801.0 | 1.0000000 |
| 501006936 | 52.0 | 52.0 | 1.0000000 |
| 501007325 | 20.0 | 20.0 | 1.0000000 |
| 501007757 | 76.5 | 76.5 | 1.0000000 |
| 501007477 | 113.0 | 109.0 | 0.9646018 |
The below plot indicates that Travel, Superstore, and Bulk Trade channels have the highest average total ordered, followed by General Activities and Academic Institutions. Other trade channels, including Healthcare, Recreation, and Defense, have lower average total ordered values. This comparison highlights that specific channels tend to have a very high average total ordered where others are consistently lower.
# Aggregate data to calculate the average total ordered per trade channel
top_trade_channels <- swire_data_full %>%
group_by(trade_channel) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm = TRUE)) %>%
arrange(desc(AVG_TOTAL_ORDERED)) %>%
slice_head(n = 20) # Select top 10 trade channels
# Create bar plot
ggplot(top_trade_channels, aes(x = reorder(trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill
= trade_channel)) +
geom_bar(stat = "identity", alpha = 0.7, show.legend = FALSE) + # Bar plot without legend
labs(
title = "Top 10 Trade Channels by Average Total Ordered",
x = "Trade Channel",
y = "Average Total Ordered"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # Rotate x-axis labels for readabilityThe graph below illustrates that Cruise, Online Store, and Recreation Park have the highest average total ordered volumes. Additionally, Bulk Trade, Other Travel, and Comprehensive Provider exhibit notable order volumes, further reinforcing their contribution to overall sales.
On the other hand, other sub-trade channels, such as Game Center, Recreation Film, and Fast Food, show comparatively lower average total ordered values. This comparison highlights the significant variation in demand across sub-trade channels, suggesting that order volume trends are strongly influenced by industry-specific purchasing behaviors.
# Aggregate data to calculate the average total ordered per sub-trade channel
top_10_sub_trade <- swire_data_full %>%
group_by(sub_trade_channel) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm =TRUE)) %>%
arrange(desc(AVG_TOTAL_ORDERED)) %>%
slice_head(n =20) # Select top 20 %>%
ggplot(top_10_sub_trade, aes(x = reorder(sub_trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill = sub_trade_channel)) +
geom_bar(stat ="identity",
alpha = 0.7,
show.legend =FALSE) + # Bar plot without legend
labs( title ="Top 10 Sub Trade Channels by Average Total Ordered", x ="Sub Trade Channel", y ="Average Total Ordered") +
theme_minimal() +
theme(axis.text.x = element_text(angle =45, hjust =1, vjust =1)) # Rotate x-axis labels for readabilityOrder Type
The graph below indicates that higher order volumes primarily come from MyCoke Legacy. This trend could be attributed to the fact that most sales originate from MyCoke Legacy and Sales Rep orders, or it may suggest a potential link between order type and high-growth potential customers.
When analyzing customers with < 400 volume, Call Center is the most common order type, while Sales Rep orders drop to third place. This shift suggests that order type could be a valuable characteristic for identifying high-growth potential customers, particularly among lower-volume segments. Further analysis could help determine whether order type patterns correlate with customer expansion trends.
# Aggregate order data by order type
order_type_summary <- swire_data_full %>%
group_by(order_type) %>%
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Gallons")
# Plot total ordered cases by order type for customers with < 400 gallons per year
swire_data_full %>%
group_by(customer_number, order_type) %>%
summarize(total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)) %>%
filter(total_gallons < 400) %>%
group_by(order_type) %>%
summarize(gallons = sum(total_gallons)) %>%
ggplot(aes(x = reorder(order_type, -gallons), y = gallons, fill = order_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Cases")`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Cold Drink Channel
The graphs below indicate that Dining and Goods are the primary cold drink channels for customers with a volume of less than 400 gallons, whereas Bulk Trade and Dining dominate across all customers. This distinction suggests that conducting a separate ‘Dining’ cold drink channel analysis could be valuable in comparing the characteristics and ordering behaviors between customers below and above the 400-gallon threshold.
# Plot total gallons ordered by channel
swire_data_full %>%
group_by(cold_drink_channel) %>%
summarize(
gallons = sum(ordered_gallons + ordered_cases) # Combine gallons and cases into a total volume
) %>%
ggplot(aes(
x = reorder(cold_drink_channel, -gallons), # Reorder bars by descending volume
y = gallons,
fill = cold_drink_channel # Use color to differentiate channels
)) +
geom_bar(stat = "identity") + # Use raw values (not counts) for bar height
theme_minimal() + # Apply clean minimal theme
theme(
axis.text.x = element_text(angle = 45, hjust = 1) # Tilt x-axis labels for readability
) +
ggtitle("Barplot of Ordered Gallons by Cold Drink Channel") +
labs(x = "Cold Drink Channel", y = "Total Gallons")# Plot total gallons for customers with fewer than 400 gallons ordered
swire_data_full %>%
group_by(customer_number, cold_drink_channel) %>%
summarize(
total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE) # Sum volume per customer/channel
) %>%
filter(total_gallons < 400) %>% # Filter to small-volume customers
group_by(cold_drink_channel) %>%
summarize(
gallons = sum(total_gallons) # Aggregate across small-volume customers
) %>%
ggplot(aes(
reorder(cold_drink_channel, -gallons), # Reorder bars by descending volume
y = gallons,
fill = cold_drink_channel
)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
) +
ggtitle("Ordered Gallons by Cold Drink Channel (Customers < 400 Gallons)") +
labs(x = "Cold Drink Channel", y = "Total Gallons")`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Local Market Partners
The table below presents order volume and unit delta from 2023 to 2024 by customer, segmented by trade channel and sub-trade channel. Among the sub-trade channels, Cruises, Online Stores, Bulk Trade, and Comprehensive Providers exhibit higher unit delta year-over-year, indicating strong growth trends.
Additionally, the Fast Casual Dining trade channel stands out with notable year-over-year increases in order volume. This suggests that Fast Casual Dining customers within the local market partner segment could represent less obvious but high-growth potential opportunities, making them a strategic group to explore further.
# Year over year changes in gallon volume by customer
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(customer_number, year, trade_channel, sub_trade_channel) %>%
summarize(gallons = sum(ordered_gallons + ordered_cases)) %>%
pivot_wider(names_from = year, values_from = gallons) %>%
mutate(unit_delta = (`2024` - `2023`)) %>%
arrange(desc(unit_delta))`summarise()` has grouped output by 'customer_number', 'year', 'trade_channel'.
You can override using the `.groups` argument.
# A tibble: 16,746 × 6
# Groups: customer_number, trade_channel [16,746]
customer_number trade_channel sub_trade_channel `2023` `2024` unit_delta
<chr> <fct> <fct> <dbl> <dbl> <dbl>
1 500733119 TRAVEL CRUISE 139520 1.69e5 29102
2 600064039 TRAVEL CRUISE 160752 1.88e5 27734.
3 600573998 TRAVEL CRUISE 376588 4.03e5 26077
4 500405618 BULK TRADE BULK TRADE 588 2.60e4 25422
5 501031108 SUPERSTORE ONLINE STORE 38909 6.12e4 22288.
6 600261951 TRAVEL CRUISE 106644 1.27e5 20502
7 600585734 TRAVEL CRUISE 84808 1.03e5 17747
8 600576028 SUPERSTORE ONLINE STORE 29744 4.65e4 16738.
9 501202275 PROFESSIONAL SERV… OTHER PROFESSION… 216 1.67e4 16494
10 501562278 GENERAL COMPREHENSIVE PR… 5858 2.22e4 16343
# ℹ 16,736 more rows
The graphs below illustrate the distribution of order types among local market partners. Notably, the EDI order type is the most dominant among local market partners. However, among customers with a volume of less than 400 gallons, Call Center and MyMoke360/Legacy orders remain the preferred choices.
While this difference may not seem highly significant, the shift in order type preference for customers with < 400 gallons—both in the local market partner segment and across all customers—could indicate that this variable holds some importance in customer behavior.
# For all local market partner customers
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(order_type) %>%
summarize(
gallons = sum(ordered_gallons + ordered_cases) # Combine gallons and cases into total volume
) %>%
ggplot(aes(
x = reorder(order_type, -gallons), # Reorder bars by descending volume
y = gallons,
fill = order_type
)) +
geom_bar(stat = "identity") + # Use actual values for bar height
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1) # Tilt x-axis labels for readability
) +
ggtitle("Barplot of Ordered Gallons by Order Type") +
labs(
title = "Total Ordered Cases by Order Type",
x = "Order Type",
y = "Total Cases"
)# For customers with < 400 gallons per year
swire_data_full %>%
filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
group_by(customer_number, order_type) %>%
summarize(
total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE) # Sum volume by customer/order type
) %>%
filter(total_gallons < 400) %>%
group_by(order_type) %>%
summarize(
gallons = sum(total_gallons) # Aggregate total gallons for small-volume customers
) %>%
ggplot(aes(
x = reorder(order_type, -gallons),
y = gallons,
fill = order_type
)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
) +
ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
labs(
title = "Total Ordered Cases by Order Type",
x = "Order Type",
y = "Total Cases"
)`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
The visuals below indicate that non-partners generally exhibit higher median total orders and a wider range of order volumes, with some extreme outliers. In contrast, local market partners tend to have lower total orders and a smaller interquartile range, suggesting less variability in their order quantities.
This difference in distribution may reflect distinct ordering behaviors between the two groups, warranting further analysis to understand the factors influencing order volume consistency among local market partners.
# Define threshold for filtering (optional)
NUMBER <- 1000 # Exclude extreme values if needed
# Filter dataset to exclude outliers (optional)
ORDERED_UNDER_NUMBER <- swire_data_full %>%
filter(ordered_gallons + ordered_cases <= NUMBER)
# Create boxplot to compare TOTAL_ORDERED by LOCAL_MARKET_PARTNER status
ggplot(ORDERED_UNDER_NUMBER, aes(x = local_market_partner, y = ordered_gallons + ordered_cases, fill = local_market_partner)) +
geom_boxplot(alpha =0.6) +# Boxplot with transparency
labs( title = paste("LOCAL_MARKET_PARTNER vs. TOTAL_ORDERED (Excluding Orders >", NUMBER,")"), x ="Local Market Partner Status", y ="Total Ordered") + theme_minimal()# Compute average TOTAL_ORDERED for Local Market Partners (TRUE) and Non-Partners (FALSE)
avg_total_order <- swire_data_full %>%
group_by(local_market_partner) %>%
summarise(AVG_TOTAL_ORDERED = mean(ordered_gallons + ordered_cases, na.rm =TRUE))
# Display result
print(avg_total_order)# A tibble: 2 × 2
local_market_partner AVG_TOTAL_ORDERED
<lgl> <dbl>
1 FALSE 57.6
2 TRUE 32.2
C02 Customer
CO2 customers are an interesting subset of the data. We can look at customers who purchase fountain products (i.e.e “ordered_gallons”) and compare their distribution across CO2 customer status and even adjust for local market partner status.
swire_data_full |>
group_by(co2_customer, customer_number, local_market_partner) |>
summarise(
total = sum(ordered_gallons) # Sum total gallons per customer
) |>
ungroup() |>
filter(total > 0) |> # Exclude customers with zero volume
ggplot(
aes(x = local_market_partner, total) # Compare total volume by LMP status
) +
geom_boxplot(color = swire_colors$red) + # Red boxplots
scale_y_log10() + # Log scale to account for skewed volume distribution
facet_wrap(~co2_customer) + # Separate plots for CO2 vs non-CO2 customers
labs(
title = "Distribution of Total Volume Among 'Fountain' Products",
subtitle = "Grouped by status as a CO2 customer",
x = "Local Market Partner?"
) +
theme_swire() +
theme(
axis.title.y = element_blank() # Omit y-axis title for cleaner look
)`summarise()` has grouped output by 'co2_customer', 'customer_number'. You can
override using the `.groups` argument.
Among those order fountain products, we really aren’t seeing any measurable difference. There’s subtlties in the outliers but non-CO2 customers are meaningfully different in order volume. Even adjusting for local_market_partner isn’t demonstrating any meaningful pattern.
Customer Tenure
An interesting component to this analysis is varying levels of tenure, where some customers are “new” (their first order in the time window) while others having been customers for ages. We can look at order volume by tenure (years) for any patterns.
swire_data_full |>
group_by(customer_number) |>
summarise(
tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)), # Calculate tenure in full years
total = sum(ordered_total) # Total volume per customer
) |>
ungroup() |>
mutate(
tenure_years = as.factor(tenure_years) # Convert tenure to factor for discrete x-axis
) |>
ggplot(
aes(tenure_years, total) # Compare volume across tenure groups
) +
geom_boxplot(color = swire_colors$red) + # Red boxplots
scale_x_discrete() + # Ensure tenure groups appear as categories
scale_y_log10() + # Log scale for better visibility of spread
labs(
title = "Distribution of Total Volume",
subtitle = "Organized by tenure and order type"
) +
theme_swire()Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 137 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Predictably, customers with less tenure than our two (2) year window will have less than others. There also appears to be many outliers during the initial years of tenure. Later on, however, there’s more variability in the IQR. We can expand this even more by introducing another variable and simplifying to tenure bins.
swire_data_full |>
group_by(customer_number, order_type) |>
summarise(
tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)), # Calculate full years of tenure
total = sum(ordered_total) # Total volume per customer/order type
) |>
ungroup() |>
mutate(
tenure_years_bin = as.factor(cut(tenure_years, breaks = 4)) # Bin tenure into 4 equal-width ranges and convert to factor
) |>
ggplot(
aes(tenure_years_bin, total) # Plot volume by tenure bin
) +
geom_boxplot(color = swire_colors$red) + # Red boxplots
scale_y_log10() + # Log scale to account for skew
facet_wrap(~order_type) + # Separate plots by order type
labs(
title = "Distribution of Total Volume",
subtitle = "Organized by tenure and order type"
) +
theme_swire()`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 2422 rows containing non-finite outside the scale range
(`stat_boxplot()`).
We see here that there really isn’t any pattern for time/tenure playing into the equation. We could look at the same for cold_drink_channel and the same bears true. It suggests that there’s no general trend of time being a factor for increased order volume, even down to sub groups.
Geographic Order Volume
We can also get a sense for concentration of orders by geography. We’ll build off of the map we created above but scale the size of the points to the volume of orders.
swire_cust_vol <-
swire_data_full |>
group_by(customer_number, lon, lat, local_market_partner) |>
summarise(
total = sum(ordered_total) # Total ordered volume per customer
) |>
ungroup() |>
mutate(
total_scaled = (total / max(total)) * 100 # Scale volume to 0–100 range for visualization
)`summarise()` has grouped output by 'customer_number', 'lon', 'lat'. You can
override using the `.groups` argument.
swire_map_vol <-
leaflet(height = 800, width = 800) |> # Initialize leaflet map with set dimensions
addTiles() |> # Add default OpenStreetMap base tiles
setView(
lng = mean(swire_cust_vol$lon), # Center longitude at the mean customer location
lat = mean(swire_cust_vol$lat), # Center latitude at the mean customer location
zoom = 5 # Set zoom level for a broad regional view
) |>
addProviderTiles("CartoDB.Positron") |> # Use a clean, minimal basemap
addCircleMarkers(
data = swire_cust_vol,
lng = ~lon, # Longitude from dataset
lat = ~lat, # Latitude from dataset
radius = ~sqrt(total_scaled), # Use square root to visually balance the scaled volume
color = ~swire_colors$red, # Use custom red from palette
fillOpacity = 0.15 # Low opacity to reduce visual clutter when markers overlap
)
swire_map_volThere are certainly hot beds of more activity than others but there isn’t a strong rhyme or reason pattern that’s noticed from this view. Better still would be quantifying density to other delivery points and/or to sister franchises. We could use seomthing like the haversine formula as a proxy.
Conclusion
This exploratory data analysis (EDA) exercise has established a foundation for our group in some key areas:
We are far more familiar with Swire Coca-Cola’s business model, data structure, and customer nuances. This will equip us with necessary context for the modeling stage.
We have a solid understanding of areas to model and explore more deeply given the lessons learned and findings extracted. This means we’ll be more efficient and effective.
Our team has yielded “proof of concept” in how we collaborate cognitively (i.e. exploration of ideas) and structurally. This is crucial as the group looks to deliver a powerful solution to Swire Coca-Cola by project end.
During the next phase of the project, Modeling, our team will leverage these insights with more advanced analysis techniques, such as leveraging cluster algorithms, cost savings models, and both temporal and spatial analyses. This is all in the effort to develop a framework for better distinguishing high-growth potential from low-growth customers, offering a framework that improves on the current 400-gallon threshold.
Division of Work
Formulation of this group EDA was made possible thanks to the collaborative partnership from all group members.
As a group, we’ve partnered in the following ways:
Worked out of a group GitHub repository, following best practices of commits to branches and issuing pull requests, with teammates reviewing, approving, and merging work to
main.Partnered on data wrangling logic, determining how we would clean, standardize, and resolve inherent constraints of the provided data. This was the recipe for individual work described below.
Consolidated individual work into this document, ensuring thoughtful organization, engaging content, and attractive design.
Each member contributed with individual work as described below:
Indvidiually, we created a unique EDA document comprised of our own insights through extensive research and time with the data sets.
Individually, we created a cleaning script adhering to the defined plan as described above. This approach helped ensure all gained valuable programming experience and were intimately familiar with our cleaned data.
Thank you,
Adam, Georgia, Tyler, Zac